A Study of Web Information Extraction Technology Based on Beautiful Soup

نویسندگان

Chunmei Zheng

Guomei He

Zuojie Peng

چکیده

In the context of comparative analysis of common web information retrieval technologies, this article discusses the principles and applications of Beautiful Soup, a vertical information search technology based on DOM tree structure. Supported by actual system examples and centering on the system architecture and core technology, this article discusses how to use Beautiful Soup to conduct deep information retrieval for partially structured webpage data, obtain directional information, reorganize the information, and then send the information to users via text message. The test results demonstrate that the web crawler achieved over 95% accuracy, satisfying the needs for commercial application.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

The relationship between social problem solving with acceptance and use of Web-based resources in educational-research activities adopting a Technology Acceptance Model (TAM)

Aim: Problem-solving is one of the most important issues in the field of psychology. It seems that solving social problems as an external variable has an important role in the acceptance of information technology. Therefore, the aim of the present study was to investigate the relationship between social problem solving and the use and acceptance of web-based resources in educational-research ac...

متن کامل

Identification and Classification of Desirable Web-Based Services from the Perspective of Website Users of Iran’s Hospitals Based on Kano Model of Customer Satisfaction

Background and Aim: A hospital website is an appropriate system for exchanging information and connecting patients, hospitals and medical staff. The purpose of this study was to identify and classify desirable web-based services in websites of Iran's hospitals based on Kano’s Customer Satisfaction Model. Materials and Methods: This was a survey study. The statistical population of the study co...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 10 شماره

صفحات -

تاریخ انتشار 2015

A Study of Web Information Extraction Technology Based on Beautiful Soup

نویسندگان

چکیده

منابع مشابه

Data Extraction using Content-Based Handles

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

The relationship between social problem solving with acceptance and use of Web-based resources in educational-research activities adopting a Technology Acceptance Model (TAM)

Identification and Classification of Desirable Web-Based Services from the Perspective of Website Users of Iran’s Hospitals Based on Kano Model of Customer Satisfaction

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

عنوان ژورنال:

اشتراک گذاری